Skip to content

[maintenance] diagnose long response latency #1421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

pcrespov
Copy link
Member

@pcrespov pcrespov commented Mar 31, 2020

What do these changes do?

Monitors webserver's traffic and diagnoses when the average latency of the last requests is over a given threshold. At that point, the diagnostics raises HeathError which flags the service as unhealthy (i.e. healthy entrypoint returns 503). The swarm healthcheck mechanism can decide to restart it.

  • /v0/health is new healthcheck entrypoint. The previous entrypoint, /v0/, is kept as a run check.

  • New webserver module: diagnostics: takes over the service monitoring, healthcheck & diagnostics entrypoints and its role is to monitor, diagnose and flag the state of the service.

    • Tried new module file layout based on experience with fastapi: plugin (for setup), core (most basic func and options), entrypoints (handlers and routing)
    • Minor changes in config implies too many code changes: need to try pydantic approach and replace current trafaret's!
  • testing fixtures that emulate some of the situations found in actual deployed environs (see services/web/server/tests/unit/test_healthcheck.py)

  • Various small improvements in servicelib (i.e. Many TODO annotations not just deleted and actually implemented ;-) ) [see commits and file changes for details]

  • Some fixes in Makefile for recipes with target

  • Fixed issue with os.makedirs failure in development mode (@ignapas pls check in your OS)

Related issue number

Supports #1426

How to test

cd services/web
make devenv install-dev
make tests-unit

Checklist

  • Did you change any service's API? Then make sure to bundle document and upgrade version (make openapi-specs, git commit ... and then make version-*)
  • Unit tests for the changes exist
  • Runs in the swarm
  • Documentation reflects the changes
  • New module? Add your github username to .github/CODEOWNERS

@pcrespov pcrespov added this to the Dim Sum milestone Mar 31, 2020
@pcrespov pcrespov self-assigned this Mar 31, 2020
@codecov
Copy link

codecov bot commented Mar 31, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@a7fac9b). Click here to learn what that means.
The diff coverage is 82.25%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1421   +/-   ##
=========================================
  Coverage          ?   71.44%           
=========================================
  Files             ?      240           
  Lines             ?     9512           
  Branches          ?     1044           
=========================================
  Hits              ?     6796           
  Misses            ?     2412           
  Partials          ?      304
Flag Coverage Δ
#integrationtests 58.53% <41.08%> (?)
#unittests 62.58% <82.25%> (?)
Impacted Files Coverage Δ
...ervice-library/src/servicelib/application_setup.py 81.08% <ø> (ø)
...rc/simcore_service_webserver/application_config.py 100% <ø> (ø)
...kages/service-library/src/servicelib/monitoring.py 0% <0%> (ø)
...erver/src/simcore_service_webserver/application.py 94.73% <100%> (ø)
...erver/src/simcore_service_webserver/rest_routes.py 100% <100%> (ø)
packages/service-library/src/servicelib/utils.py 67.56% <100%> (ø)
...ver/src/simcore_service_webserver/rest_handlers.py 100% <100%> (ø)
...rc/simcore_service_webserver/diagnostics_plugin.py 100% <100%> (ø)
...s/service-library/src/servicelib/rest_responses.py 80% <50%> (ø)
...s/web/server/src/simcore_service_webserver/rest.py 89.58% <66.66%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a7fac9b...96fdc6a. Read the comment docs.

@pcrespov pcrespov force-pushed the maintenance/diagnose-cancellations branch 2 times, most recently from 30f5e89 to b76fb44 Compare April 3, 2020 06:47
@pcrespov pcrespov marked this pull request as ready for review April 3, 2020 07:43
@pcrespov pcrespov changed the title WIP: [maintenance] diagnose task cancellations [maintenance] diagnose long response latency Apr 3, 2020
@pcrespov pcrespov force-pushed the maintenance/diagnose-cancellations branch from bdbeb84 to 96fdc6a Compare April 3, 2020 09:11
Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect!

@pcrespov pcrespov merged commit a957016 into ITISFoundation:master Apr 3, 2020
@pcrespov pcrespov deleted the maintenance/diagnose-cancellations branch April 3, 2020 11:15
@sanderegg sanderegg linked an issue Apr 6, 2020 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

platform stability
3 participants